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Structured Knowledge Representation for Image Retrieval 



We propose a structured approach to the problem of retrieval of images by content and 
present a description logic that has been devised for the semantic indexing and retrieval of 
images containing complex objects. 

As other approaches do, we start from low-level features extracted with image analysis 

to detect and characterize regions in an image. However, in contrast with feature-based ap- 
proaches, we provide a syntax to describe segmented regions as basic objects and complex 
objects as compositions of basic ones. Then we introduce a companion extensional seman- 
tics for defining reasoning services, such as retrieval, classification, and subsumption. These 
services can be used for both exact and approximate matching, using similarity measures. 

Using our logical approach as a formal specification, we implemented a complete client- 
server image retrieval system, which allows a user to pose both queries by sketch and queries 
by example. A set of experiments has been carried out on a testbed of images to assess the 
retrieval capabilities of the system in comparison with expert users ranking. Results are 
presented adopting a well-established measure of quality borrowed from textual information 
retrieval. 

1. Introduction 

Image retrieval is the problem of selecting, from a repository of images, those images ful- 
filling to the maximum extent some criterion specified by an end user. In this paper, we 
concentrate on content-based image retrieval, in which criteria express properties of the 
appearance of tlie image itself, i.e., on its pictorial characteristics. 

Most of the research in this field has till now concentrated in devising suitable techniques 
for extracting relevant cues with the aid of image analysis algorithms. Current systems result 
effective when the specified properties are so-called low-level characteristics, such as color 
distribution, or texture. For example, systems such as IBM's QBIC^ can easily retrieve, 
among others, stamps containing the picture of a brown horse in a green field, when asked 
to retrieve images of stamps with brown central area over a greenish background. 

Nevertheless, present systems fail at treating correctly high-level characteristics of an 
image — such as, "retrieve stamps with a galloping horse" . First of all, most systems cannot 
even allow the user to specify such queries, because they lack a language for expressing high- 
level features. Usually, this is overcome with the help of examples: "retrieve images similar 
to this one" . However, examples are quite ambiguous to interpret: which are the features 
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in the example that should appear in retrieved images? This ambiguity produces a lot of 
"false positives", as any one can experience. 

Even if relevant features are pointed out in the example, the system cannot tell whether 
what is pointed out is the color distribution, or its interpretation — after all, a galloping 
brown horse produces a color distribTition which is more similar to a running brown fox than 
to a galloping white horse. In this aspect, image retrieval faces the same problems of object 
recognition, which is a central problem in robotics and artificial vision. The only effective 
solution overcoming this problem is to associate to a query some significant keywords, which 
should match keywords attached in some way to images in the repository. Here ambiguities 
in image understanding are just transferred to text understanding, as now a brown portrait 
of Crazy Horse — the famous Indian chief — could be considered relevant. 

Resorting to human experts to specify the expected output of a retrieval algorithm can, 
in our opinion, only worsen these ambiguities, since it makes the correctness of an approach 
to depend from a subjective perception of what an image retrieval system should do. What 
is needed is a formal, high-level specification of the image retrieval task. This need motivates 
the research we report here. 

1.1 Contributions of the Paper 

We approach the problem of image retrieval from a knowledge representation perspective, 
and in particular, we refer to a framework already successfully applied by Woods and 
Schmolze (1992) to conceptual modeling and semantic data models in databases (Calvanese, 
Lenzerini, & Nardi, 1998). We consider image retrieval as a knowledge representation 
problem, in which we can distinguish the following aspects: 

Interface: the user is given a simple visual language to specify (by sketch or by example) 
a geometric composition of basic shapes, which we call description. The composite shape 
description intuitively stands for a set of images (all containing the given shapes in their 
relative positions); it can be used either as a query, or as an index for a relevant class of 
images, to be given some meaningful name. 

Syntax and semantics: the system has an internal syntax to represent the user's 
queries and descriptions, and the syntax is given an extensional semantics in terms of sets 
of retrievable images. In contrast with existing image retrieval systems, our semantics is 
compositional, in the sense that adding details to the sketch may only restrict the set of 
retrievable images. Syntax and semantics constitute a Semantic Data Model, in which the 
relative position, orientation and size of each shape component are given an explicit no- 
tation through a geometric transformation. The extensional semantics allows us to define 
a hierarchy of composite shape descriptions, based on set containment between interpre- 
tations of descriptions. Coherently, the recognition of a shape description in an image is 
defined as an interpretation satisfying the description. 

Algorithms and complexity: based on the semantics, we prove that subsumption 
between descriptions can be carried out in terms of recognition. Then we devise exact and 
approximate algorithms for composite shapes recognition in an image, which are correct with 
respect to the semantics. Ideally, if the computational complexity of the problem of retrieval 
was known, the algorithms should also be optimal with reference to the computational 
complexity of the problems. Presently, we solved the problem for exact retrieval, and 
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propose an algorithm for approximate retrieval which, although probably non-optimal, is 

correct. 

Experiments: while the study of the complexity of the problem is ongoing, the syntax, 
semantics, and sub-optimal algorithms obtained so far are already sufficient to provide the 
formal specification of a prototype system for the experimental verification of our approach. 

The prototype has been used to carry out a set of experiments on a test database of images, 
which allowed us to verify the effectiveness of the proposed approach in comparison with 
expert users ranking. 

We believe that a knowledge representation approach brings several benefits to research 
in image retrieval. First of all, it separates the problem of finding an intuitive semantics for 
query languages in image retrieval from the problem of implementing a correct algorithm 
for a given semantics. Secondly, once the problem of image retrieval is semantically formal- 
ized, results and techniques from Computational Geometry can be exploited in assessing 
the computational complexity of the formalized retrieval problem, and in devising efficient 
algorithms, mostly for the approximate image retrieval problem. This is very much in the 
same spirit as finite model theory has been used in the study of complexity of query answer- 
ing for relational databases (Chandra &l Harel, 1980). Third, our language borrows from 
object modeling in Computer Graphics the hierarchical organization of classes of images 
(Foley, van Dam, Feiner, & Hughes, 1996). This, in addition to an interpretation of compos- 
ite shapes which one can immediately visualize, opens our logical approach to retrieval of 
images of 3D-objects constructed in a geometric language (Paquet & Rioux, 1998), which is 
still to be explored. Fourth, our logical formalization, although simple, allows for extensions 
which are natural in logic, such as disjunction (OR) of components. Although alternative 
components of a complex shape are difficult to be shown in a sketch, they could be used 
to specify moving {i.e., non-rigid) parts of a composite shape. This exemplifies how our 
logical approach can shed light to extensions of our syntax suitable for, e.g., video sequence 
retrieval. 

1.2 Outline of the Paper 

The rest of the paper is organized as follows. In the next section, we review related work 
on image retrieval. In Section 3 we describe our formal language, first its syntax, then its 
semantics, and we start proving some basic properties. In the following section, we analyze 
the reasoning problems and the semantic relations among them, and we devise algorithms 
that can solve them. Then in Section 5 we illustrate the architecture of our system and 
propose some examples pointing out distinguishing aspects of our approach. In Section 6 
we present a set of experiments to assess retrieval capabilities of the system. Last section 
draws the conclusions and proposes directions for future work. 

2. Related Work 

Content-Based Image Retrieval (CBIR) has recently become a widely investigated research 
area. Several systems and approaches have been proposed; here we briefly report on some 
significant examples and categorize them in three main research directions. 
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2.1 Feature-based Approaches 

Largest part of research on CBIR has focused on low-level features such as color, texture, 
shape, which can be extracted using image processing algorithms and used to characterize 
an image in some feature space for subsequent indexing and similarity retrieval. In this 
way the problem of retrieving images with homogeneous content is substituted with the 
problem of retrieving images visually close to a target one (Hirata & Kato, 1992; Niblak 
et al., 1993; Picard &: Kabir, 1993; Jacobs, Finkelsteiri, & Salesin, 1995; Flickner et al., 
1995; Bach, Fuller, Gupta, Hampapur, Horowitz, Humphrey, Jain, & Shu, 1996; Celentano 
& Di Sciascio, 1998; Cox, Miller, Minka, & Papathomas, 2000; Gevers & Smeulders, 2000). 

Among the various projects, particularly interesting is the QBIC system (Niblak et al., 
1993; Flickner et al., 1995), often cited as the ancestor of all other CBIR systems, which 
allows queries to be performed on shape, texture, color, by example and by sketch using as 
target media both images and shots within videos. The system is currently embedded as a 
tool in a commercial product, Ultimedia Manager. Later versions have introduced an 
automated foreground/background segmentation scheme. Here the indexing of an image is 
made on the principal shape, with the aid of some heuristics. This is an evident limitation: 
most images do not have a main shape, and objects are often composed of various parts. 

Other researchers, rather than concentrating on a main shape, which is typically as- 
sumed located in the central part of the picture, have proposed to index regions in images; 
so that the focus is not on retrieval of similar images, but of similar regions within an 
image. Examples of this idea are VisualSeek (Smith & Chang, 1996), NETRA (Ma 
Sz Manjunath, 1997) and Blobworld (Carson, Thomas, Belongie, Hellerstein, & Malik, 
1999). The problem is that although all these systems index regions, they lack of a higher 
level description of images. Hence, they are not able to describe — and hence query for — 
more than a single region at a time in an image. 

In order to improve retrieval performances, m^icli attention has been paid in recent 
years to relevance feedback. Relevance feedback is the mechanism, widely used in textual 
information systems, which allows improving retrieval effectiveness by incorporating the 
user in the query-retrieval loop. Depending on the initial query the system retrieves a set 
of documents that the user can mark either as relevant or irrelevant. The system, based on 
the user preferences, refines the initial query retrieving a new set of documents that should 
be closer to the user's information need. 

This issue is particularly relevant in feature-based approaches, as on one hand, the user 
lacks of a language to express in a powerful way her information need, but on the other 
hand, deciding whether an image is relevant or not takes just a glance. Examples of systems 
using relevance feedback include MARS (Rui, Huang, & Mehrotra, 1997), DrawSearch 
(Di Sciascio & Mongiello, 1999) and PicHunter (Cox et al., 2000). 

2.2 Approaches Based on Spatial Constraints 

This type of approach to the problem of image retrieval concentrates on finding the simi- 
larity of images in terms of spatial relations among objects in them. Usually the emphasis 
is only on relative positions of objects, which are considered as "symbolic images" or icons, 
identified with a single point in the 2D-space. Information on the content and visual ap- 
pearance of images are normally neglected. 
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Chang, Shi, and Yan (1983) present the modehng of this type of images in terms of 
2D-strings, each of the strings accounting for the position of icons along one of the two 
planar dimensions. In this approach retrieval of images basically reverts to simpler string 
matching. 

Gudivada and Raghavan (1995) consider the objects in a symbolic image associated 
with vertexes in a weighted graph. Edges — i.e., lines connecting the centroids of a pair of 
objects — represent the spatial relationships among the objects and are associated with a 
weight depending on their slope. The symbolic image is represented as an edge list. Given 
the edge lists of a query and a database image, a similarity function computes the degree of 
closeness between the two lists as a measin-e of the matching between the two spatial-graphs. 
The similarity measure depends on the number of edges and on the comparison between 
the orientation and slope of edges in the two spatial-graphs. The algorithm is robust with 
respect to scale and translation variants in the sense that it assigns the highest similarity to 
an image that is a scale or translation variant of the query image. An extended algorithm 
includes also rotational variants of the original images. 

More recent papers on the topic include those by Gudivada (1998) and by El-Kwae 
and Kabuka (1999), which basically propose extensions of the strings approach for efficient 
retrieval of subsets of icons. Gudivada (1998) defines 0R-strings, a logical representation of 
an image. Such representation also provides a geometry-based approach to iconic indexing 
based on spatial relationships between the iconic objects in an image individuated by their 
centroid coordinates. Translation, rotation and scale variant images and the variants gener- 
ated by an arbitrary composition of these three geometric transformations are considered. 
The approach does not deal with object shapes, nor with other basic image features, and 
considers only the sequence of the names of the objects. The concatenation of the objects is 
based on the euclidean distance of the domain objects in the image starting from a reference 
point. The similarity between a database and a query image is obtained through a spatial 
similarity algorithm that measures the degree of similarity between a query and a database 
image by comparing the similarity between their 6'R-strings. The algorithm recognizes ro- 
tation, scale and translation variants of the image and also subimages, as subsets of the 
domain objects. A constraint limiting the practical use of this approach is the assumption 
that an image can contain at most one instance of each icon or object. 

El-Kwae and Kabuka (1999) propose a further extension of the spatial-graph approach, 
which includes both the topological and directional constraints. The topological extension of 
the objects can be obviously useful in determining further differences between images that 
might be considered similar by a directional algorithm that considers only the locations 
of objects in term of their centroids. The similarity algorithm they propose extends the 
graph-matching one previously described by Gudivada and Raghavan (1995). The similarity 
between two images is based on three factors: the number of common objects, the directional 
and topological spatial constraint between the objects. The similarity measure includes the 
number of objects, the number of common objects and a function that determines the 
topological difference between corresponding objects pairs in the query and in the database 
image. The algorithm retains the properties of the original approach, including its invariance 
to scaling, rotation and translation and is also able to recognize multiple rotation variants. 
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2.3 Logic-based and Structured Approaches 

With reference to previous work on Vision in Artificial Intelligence, the use of structural 
descriptions of objects for the recognition of their images can be dated back to Minsky's 
frames, and some work by Brooks (1981). The idea is to associate parts of an object (and 
generally of a scene) to the regions an image can be segmented into. The hierarchical 
organization of knowledge to be used in the recognition of an object was first proposed by 
Marr (1982). Reiter and Mackworth (1989) proposed a formalism to reason about maps as 
sketched diagrams. In their approach, the possible relative positions of lines are fixed and 
highly qualitative (touching, intersecting). 

Structured descriptions of three-dimensional images are already present in languages 
for virtual reality like VRML (Hartrnan & Wernecke, 1996) or hierarchical object mod- 
eling. However, the semantics of such languages is operational, and no effort is made to 
automatically classify objects with respect to the structure of their appearance. 

Meghini, Sebastiani, and Straccia (2001) proposed a formalism integrating Description 
Logics and image and text retrieval, while Haarslev, Lutz, and Moeller (1998) integrate 
Description Logics with spatial reasoning. Further extensions of the approach are described 
by Moeller, Neumann, and Wessel (1999). Both proposals build on the clean integration of 
Description Logics and concrete domains of Baader and Hanschke (1991). However, neither 
of the formalisms can be used to build complex shapes by nesting more simple shapes. 
Moreover, the proposal by Haarslev et al. (1998) is based on the logic of spatial relations 
named RCC8, which is enough for specifying meaningful relations in a map, but it is too 
qualitative to specify the relative sizes and positions of regions in a complex shape. 

Also for Hacid and Rigotti (1999) description logics and concrete domains are at the 
basis of a logical framework for image databases aimed at reasoning on query containment. 
Unfortunately, the proposed formalism cannot consider geometric transformations neither 
determine specific arrangements of shapes. 

More similar to our approach is the proposal by Ardizzone, Chella, and Gaglio (1997), 
where parts of a complex shape are described with a description logic. However, the com- 
position of shapes does not consider their positions, hence reasoning cannot take positions 
into account. 

Relative position of parts of a complex shape can be expressed in a constraint relational 
calculus in the work by Bertino and Catania (1998). However, reasoning about queries 
(containment and emptiness) is not considered in this approach. Aiello (2001) proposes a 
multi-modal logic, which provides a formalism for expressing topological properties and for 
defining a distance measure among patterns. 

Spatial relation between parts of medical tomographic images are considered by Tagare, 
Vos, JafFe, and Duncan (1995). There, medical images are formed by the intersection of the 
image plane and an object. As the image plane changes, different parts of the object are 
considered. Besides, a metric for arrangements is formulated by expressing arrangements 
in terms of the Voronoi diagram of the parts. The approach is limited to medical image 
databases and does not provide geometrical constraints. 

Compositions of parts of an image are considered in the work by SanfeliTi and Fu 
(1983) for character recognition. However, in recognizing characters, line compositions 
are "closed" , in the sense that one looks for the specified lines, and no more. Instead in our 



214 



Structured Knowledge Representation for Image Retrieval 



framework, the shape "F" composed by three hnes, is subsumed by the shape "F" — some- 
thing unacceptable in recognizing characters. Apart from the different task, this approach 
does not make use of an extensional semantics for composite shapes, hence no reasoning is 
possible. 

A logic-based multimedia retrieval system was proposed by Fuhr, Govert, and Rolleke 
(1998); the method, based on an object-oriented logic, supports aggregated objects but it 
is oriented towards a high-level semantic indexing, which neglects low-level features that 
characterize images and parts of them. 

In the field of computation theories of recognition, we mention two approaches that have 
some resemblance to our own: Biederman's structural decomposition and geometric con- 
straints proposed by Ullman, both described by Edelmann (1999). Unfortunately, neither of 
them appears suitable for realistic image retrieval: the structural decomposition approach 
does not consider geometric constraints between shapes, while the approach based on geo- 
metric constraints does not consider the possibility of defining structural decomposition of 
shapes, hence reasoning on them. 

Starting with the reasonable assumption that the recognition of an object in a scene 
can be eased by previous knowledge on the context, in the work by Pirri and Finzi (1999), 
the recognition task, or the interpretation of an image, takes advantage of the information 
a cognitive agent has about the environment, and by the representation of these data in a 
high-level formalism. 



3. SyntcLx and Semantics 

In this section we present the formalism dealing with the definition of composite shape de- 
scriptions, their semantics, and some properties that distinguish our approach from previous 
ones. 

We remark that our formalism deals with image features, like shape, color, texture, but 
is independent of the way features are extracted from actual images. For the interested 
reader, the algorithms we used to compute image features in our implementation of the 
formalism are presented in the Appendix. 



3.1 Syntax 

Our main syntactic objects are basic shapes, position of shapes, composite shape descrip- 
tions, and transformations. We also take into account the other features that typically 
determine the visual appearance of an image, namely color and texture. 

Basic shapes are denoted with the letter B, and have an edge contour e{B) characterizing 
them. We assume that e{B) is described as a single, closed 2D-curve in a space whose origin 
coincides with the centroid of B. Examples of basic shapes can be circle, rectangle, with 
the contours e(circle) = Q, e(rectaiigle) = | , but also any complete, rough contour 
— e.g., the one of a ship — is a basic shape. To make our language compositional, we 
consider only the ex ternal contour of a region. For example, if a region is contained in 



another, as in Q ■. the contour of the outer region is just the external rectangle. 

The possible transformations are the simple ones that are present in any drawing tool: 
rotation (around the centroid of the shape), scaling and translation. We globally denote a 
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Figure 1: The graphical interface with a query by sketch. 



rotation-translation-scahng transformation as r. Recall that transformations can be com- 
posed in sequences tio . . . ot„, and they form a mathematical group. 

The basic building block of our syntax is a basic shape component {c,t,T,B), which 
represents a region with color c, texture t, and edge contour T{e{B)). With T{e{B)) we 
denote the pointwise transformation r of the whole contour of B. For example, r could 
specify to place the contour e{B) in the upper left corner of the image, scaled by 1/2 and 
rotated 45 degrees clockwise. 

Composite shape descriptions are conjunctions of basic shape components — each one 
with its own color and texture — denoted as 

C = (ci, ti, n, Bi) n ■ ■ • n (c„, t„, t„, b„) 

We do not expect end users of our system to actually define composite shapes with this 
syntax; this is just the internal representation of a composite shape. The system can 
maintain it while the user draws — with the help of a graphic tool — the complex shape by 
dragging, rotating and scaling basic shapes chosen either from a palette, or from existing 
images (see Figure 1). 

For example, the composite shape lighted-candle could be defined as 

lighted-candle = (ci, ti, n, rectangle) □ (c2, t2, ^2, circle) 
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with Ti, T2 placing the circle as a flame on top of the candle, and textures and colors defined 
accordingly to the intuition. 

We remark that, to the best of our knowledge, the logic we present is the first one 
combining shapes and explicit transformations in one language. 

In a previous paper (Di Sciascio, Donini, &: Mongiello, 2000) we presented a formalism 
including nested composite shapes, as it is done in hierarchical object modeling (Foley et al., 
1996, Ch.7). However, nested composite shapes can always be fiattened by composing their 
transformations. Hence in this paper we focus on two levels: basic shapes and compositions 
of basic shapes. Also, just to simplify the presentation of the semantics, in the following 
section we do not present color and texture features, which we take into account from 
Section 4.2 on. 

3.2 Semantics 

We consider an extensional semantics, in which syntactic expressions are interpreted as 
subsets of a domain. For our setting, the domain of interpretation is a set of images A, and 
shapes and components are interpreted as subsets of A. Hence, also an image database is 
a domain of interpretation, and a complex shape C is a subset of such a domain — the 
images to be retrieved from the database when C is viewed as a query. 

This approach is quite different from previous logical approaches to image retrieval that 
view the image database as a set of facts, or logical assertions, e.g., the one based on 
Description Logics by Meghini et al. (2001). In that setting, image retrieval amounts to 
logical inference. However, observe that usually a Domain Closure Assumption (Reiter, 
1980) is made for image databases: there are no regions but the ones which can be seen in 
the images themselves. This allows one to consider the problem of image retrieval as simple 
model checking — check if a given structure satisfies a description^. 

Formally, an interpretation is a pair (X, A), where A is a set of images, and T is a 
mapping from shapes and components to subsets of A. We identify each image / with the 
set of regions {ri, . . . ,r„} it can be segmented into (excluding background, which we discuss 
at the end of this section). Each region r comes with its own edge contour e(r). An image 
I E A belongs to the interpretation of a basic shape component (r, B)-^ if I contains a 
region whose contour matches T{e{B)). In formulae, 

(r, Bf = {I eA\3reI : e(r) = T{e{B))} (1) 

The above definition is only for exact recognition of shape components in images, due to 
the presence of strict equality in the comparison of contours; but it can be extended to 
approximate recognition as follows. Recall that the characteristic function fs of a set 5 is a 
function whose value is either 1 or 0; fs{x) = 1 if a; £ S", fs{x) = otherwise. We consider 
now the characteristic function of the set defined in Formula (1). Let / be an image; if / 
belongs to (r, B)^, then the characteristic function computed on / has value 1, otherwise 
it has value 0. To keep the number of symbols low, we use the expression (r, B)^ also to 

2. Obviously, a Domain Closure Assumption on regions is not valid in artificial vision, dealing with two- 
dimensional images of tfiree-dimensional shapes (and scenes), because solid shapes have surfaces that 
will be hidden in their images. But this is outside the scope of our retrieval problem. 
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denote the characteristic function (with an argument (7) to distinguish it from the set) . 

I. B\^rn = / 1 if3re/:e(r) = r(e(B)) 
^ ' ' ^ ' I otherwise 

Now we reformulate this function in order to make it return a real number in the range [0, 1] 
— as usual in fuzzy logic (Zadeh, 1965). Let sim{-, •) be a similarity measure from pairs of 
contours into the range [0, 1] of real numbers (where 1 is perfect matching). We use sim{-, ■) 
instead of equality to compare edge contours. Moreover, the existential quantification can 
be replaced by a maximum over all possible regions in /. Then, the characteristic function 
for the approximate recognition in an image 7 of a basic component, is: 

(r, Bf{I) = max{5im(e(r), T(e(B)))} 

Note that sim depends on translations, rotation and scaling, since we are looking for regions 
in / whose contour matches e{B), with reference to the position and size specified by r. 

The interpretation of basic shapes, instead, includes a translation-rotation-scaling in- 
variant recognition, which is commonly used in single-shape Image Retrieval. We define the 
interpretation of a basic shape as 

5^ = {I e A I 3r 3r e / : e(r) = T{e{B))} 

and its approximate counterpart as the function 

B^{I) = maxmax{sim(e(r), r(e(i?)))} 

The maximization over all possible transformations max^ can be effectively computed by 
using a similarity measure simss that is invariant with reference to translation-rotation- 
scaling (see Section 4.2). Similarity of color and texture will be added as a weighted sum 
in Section 4.2. In this way, a basic shape B can be used as a query to retrieve all images 
from A which are in B^ . Therefore, our approach generalizes the more usual approaches 
for single-shape retrieval, such as Blobworld (Carson et al., 1999). 

Composite shape descriptions are interpreted as sets of images that contain all com- 
ponents of the composite shape. Components can be anywhere in the image, as long as 
they are in the described arrangement relative to each other. Let C be a composite shape 
description (ri, Bi) fl ■ • ■ Fl (t„, In exact matching, the interpretation is the intersection 
of the sets interpreting each component of the shape: 

= {/ e A I 3t : / e nUiir o n), B^f} (2) 

Observe that we require all shape components of C to be transformed into image regions 
using the same transformation r. This preserves the arrangement of the shape components 
relative to each other — given by each — while allowing C-^ to include every image 
containing a group of regions in the right arrangement, wholly displaced by r. 

To clarify this formula, consider Figure 2: the shape C is composed by two basic shapes 
Bi and B2, suitably arranged by the transformations ri and T2. Suppose now that A 
contains the image /. Then, I e C-^ because there exists the transformation r, which 
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Figure 2; An example of application of Formula (2). 



globally brings C into I, that is, r o ti brings the rectangle Bi into a rectangle recognized 
in /, and r o t2 brings the circle B2 into a circle recognized in /, both arranged according 
to C. Note that / could contain also other shapes, not included in C. 
We can now formally define the recognition of a shape in an image. 

Definition 1 (Recognition) A shape description C is recognized in an image I if for 
every interpretation {I, A) such that I e A, it is I e . An interpretation (J, A) satisfies 
a composite shape description C if there exists an image I ^ A such that C is recognized in 
I. A composite shape description is satisfiable if there exists an interpretation satisfying it. 

Observe that shape descriptions could be unsatisfiable: if two components define overlapping 
regions, no image can be segmented in a way that satisfies both components. Of course, if 
composite shape descriptions are built using a graphical tool, unsatisfiability can be easily 
avoided, so we assume that descriptions are always satisfiable. Anyway, unsatisfiable shape 
descriptions could be easily detected, from their syntactic form, since unsatisfiability can 
only arise because of overlapping regions (see Proposition 4). 

Observe also that our set-based semantics implies the intuitive interpretation of con- 
junction "n" — one could easily prove that □ is commutative and idempotent. 

For approximate matching, we modify definition (2), following the fuzzy interpretation 
of n as minimum, and existential as maximum: 

C^{I) = max {mln {((r o r,), B,f{I)}} (3) 

T 1=1 

Observe that our interpretation of composite shape descriptions strictly requires the pres- 
ence of all components. In fact, the measure by which an image / belongs to the interpreta- 
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tion of a composite shape description is dominated by the least similar shape component 
(the one with the minimum similarity) . Hence, if a basic shape component is very dissimilar 
from every region in J, this brings near to 0^ also the measure of C^{I). This is more strict 
than, e.g., Gudivada & Raghavan's (1995) or El-Kwae k, Kabuka's (1999) approaches, in 
which a non-appearing component can decrease the similarity value of C^{I), but I can be 
still above a threshold. 

Although this requirement may seem a strict one, it captures the way details are used 
to refine a query: the "dominant" shapes are used first, and, if the retrieved set is still too 
large, the user adds details to restrict the results. In this refinement process, it should not 
happen that other images that match only some new details, "pop up" enlarging the set of 
results that the user was trying to restrict. We formalize this refinement process through 
the following definition. 

Proposition 1 (Downward refinement) Let C he a composite shape description, and 

let D he a refinement of C, that is D = C n (r', B'). For every interpretation I, if shapes 
are interpreted as in (2), then C ; if shapes are interpreted as in (3), then for every 
image I it holds D^{I) < C^{I). 

Proof. For (2), the claim follows from the fact that D-^ considers an intersection of the same 
components as the one of C-^, plus the set ((r o t'),B')^. For (3), the claim analogously 
follows from the fact that D^{I) computes a minimum over a superset of the values consid- 
ered for C^(/). □ 

The above property makes our language fully compositional. Namely, let C be a com- 
posite shape description; we can consider the meaning of C — when used as a query — as 
the set of images that can be potentially retrieved using C. At least, this will be the mean- 
ing perceived by an end user of a system. Downward refinement ensures that the meaning 
of C can be obtained by starting with one component, and then progressively adding other 
components in any order. We remark that for other frameworks cited above (Gudivada &; 
Raghavan, 1995; El-Kwae & Kabuka, 1999) this property does not hold. We illustrate the 
problem in Figure 3. Starting with shape description C, we may retrieve (among many 
others) the two images /i,/2, for which both C^{Ii) and C^{l2) are above a threshold t, 
while another image is not in the set because C^{Iz) < t. In order to be more selec- 
tive, we try adding details, and we obtain the shape description D. Using D, we may still 
retrieve I2, and discard Ii. However, /a now partially matches the new details of D. If 
Downward refinement holds, D-^{Is) < C^{Iz) < t, and cannot "pop up". In contrast, 
if Downward refinement does not hold (as in Gudivada & Raghavan's approach) it can be 
> t > C^{Iz) because matched details in D raise the similarity sum weighted over 
all components. In this case, the meaning of a sketch cannot be defined in terms of its 
components. 

Downward refinement is a property linking syntax to semantics. Thanks to the exten- 
sional semantics, it can be extended to an even more meaningful semantic relation, namely, 

3. Not exactly 0, since every shape matches every other one with a very low similarity measure. Similarity 
is often computed as the inverse of a distance. Similarity would correspond to infinite distance. 
Nevertheless, the recognition algorithm can force the similarity to when it is below a threshold. 
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subsumption. We borrow this definition from Description Logics (Donini, Lenzerini, Nardi, 
& Schaerf, 1996), and its fuzzy extensions (Yen, 1991; Straccia, 2001). 

Definition 2 (Subsumption) A description C subsumes a description D if for every 
interpretation!, C . If (3) is used, C subsumes D if for every interpretation I and 
image I e A, tt ts D^{I) < C^{I). 

Subsumption takes into account the fact that a description might contain a syntactic variant 
of another, without both the user and the system exphcitly knowing this fact. The notion 
of subsumption extends downward refinement. It enables also a hierarchy of shape descrip- 
tions, in which a description D is below another C if I? is subsumed by C. When C and 
D are used as queries, the subsumption hierarchy makes easy to detect query containment. 
Containment can be used to speed up retrieval: all images retrieved using D as a query can 
be immediately retrieved also when C is used as a query, without recomputing similarities. 
While query containment is important in standard databases (Ullman, 1988), it becomes 
even more important in an image retrieval setting, since the recognition of specific features 
in an image can be computationally demanding. 

Figure 4 illustrates an example of subsumption hierarchy of basic and composite shapes 
(thick arrows denote a subsumption between shapes), and two images in which shapes can 
be recognized (thin arrows). 

Although we did not consider a background, it could be added to our framework as a 
special basic component (c, t, background) with the property that a region b satisfies the 
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background simply if their colors and textures match, with no check on the edge contours. 
Also, more than one background could be added; in that case background regions should 
not overlap, and the matching of background regions should be considered after the regions 
of all the basic shapes recognized are subtracted to the background regions. 

4. Reasoning and Retrieval 

We envisage several reasoning services that can be carried out in a logic for image retrieval: 

1. shape recognition: Given an image / and a shape description D, decide if D is recog- 
nized in I. 

2. image retrieval: given a database of images and a shape description D, retrieve all 
images in which D can be recognized. 

3. image classification: given an image I and a collection of descriptions Di, . . . , Dn, find 
which descriptions can be recognized in /. In practice, / is classified by finding the 
most specific descriptions (with reference to subsumption) it satisfies. Observe that 
classification is a way of "preprocessing" recognition. 

4. description subsumption (and classification): given a (new) description D and a col- 
lection of descriptions Di, . . . , Dn, decide whether D subsumes (or is subsumed by) 
each Di, for z = 1, . . . , n. 

While services 1-2 are standard in an image retrieval system, services 3-4 are less obvious, 
and we briefly discuss them below. 

The process of image retrieval is quite expensive, and systems usually perform off-line 
processing of data, amortizing its cost over several queries to be answered on-line. As an 
example, all document retrieval systems for the web"^, both for images and text, use spiders 
to crawl the web and extract some relevant features [e.g., color distributions and textures 
in images, keywords in texts), that are used to classify documents. Then, the answering 
process uses such classified, extracted features of documents — and not the original data. 

Our system can adapt this setting to composite shapes, too. In our system, a new 
image inserted in the database is immediately segmented and classified in accordance with 
the basic shapes that compose it, and the composite descriptions it satisfies (Service 3). Also 
a query undergoes the same classification, with reference to the queries already answered 
(Service 4). The more basic shapes are present, the faster will the system answer new 
queries based on these shapes. 

More formally, given a query (shape description) D, if there exists a collection of de- 
scriptions Dn and all images in the database were already classified with reference 
to Dn, then it may suffice to classify D with reference to Di, . . . , Dn to find (most 
of) the images satisfying D. This is the usual way in which classification in Description 
Logics — which amounts to a semantic indexing — can help query answering (Nebel, 1990). 

For example, to answer the query asking for images containing an arch, a system may 
classify arch and find that it subsumes threePortalsGate (see Figure 4). Then, the system 

4. e.g., AltaVista, QBIC, NETRA, Blobworld, but also Yahoo (for textual documents). 
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can include in the answer all images in which ancient Roman gates can be recognized, 
without recomputing whether these images contain an arch or not. 

The problem of computing subsumption between descriptions is reduced to recognition 
in the next section, and then an algorithm for exact recognition is given. Then, we extend 
the algorithm to realistic approximate recognition, reconsidering color and texture. 



4.1 Exact Reasoning on Images and Descriptions 

We start with a reformulation of (2), more suited for computational purposes. 

Theorem 2 (Recognition as mapping) Let C = (ti, Bi) fl ■ • • □ {t„, Bn) he a composite 
shape description, and let I he an image, segmented into regions {ri, . . . ,rm}. Then C is 
recognized in I iff there exists a transformation r and an injective mapping j : {1, . . . , n} — > 
{!,..., m} such that for i = 1, . . . ,n it is 

e{r,(^)) = T{Ti{e{Bi))) 

Proof. From (2), C is recognized in / iff 

n n 

3t[I e Pi ((r o Ti),Bi)-'-] which is equivalent to 3t[/\ / G ((t o Tj), Sj)-^] 
i=i i=i 

Expanding ((r o Xj), Bi)^ with its definition (1) yields 

n 

3r[/\ 3r e I.e{r) = T{n{e{Bi)))] 

i=l 

and since regions in I are {ri, . . . ,rm} this is equivalent to 

n m 

3r[/\\/ e{rj)=r{nie{Bm 
i=ii=i 

Making explicit the disjunction over j and conjunctions over i, we can arrange this con- 
junctive formula as a matrix: 



3t 



(e(ri) =r(Ti(e(Bi))) V •■• V e(r„) = r(Ti(e(Si)))) ) A 

: V : V : A 

(e(ri) = T(T„(e(B„))) V ••• V e(r„,) = T(T„(e(B„)))) ) 



(4) 



Now we note two properties in the above matrix of equalities: 

1. For a given transformation, at most one region among ri, . . . ,rm can be equal to each 
component. This means that in each row, at most one disjunct can be true for a given 
r. 

2. For a given transformation, a region can match at most one component. This means 
that in each column, at most one equality can be true for a given r. 
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We observe that these properties do not imply that regions have all different shapes, since 
the equality of contours depends on any translation, rotation, and scaling. We use equality 
to represent true overlap, and not just equal shape. 

Properties 1-2 imply that the above formula is true iff there is an injective function 
mapping each component to one region it matches with. To ease the comparison with the 
formulae above we use the same symbol j as a mapping j : {1, . . . , n} — > {1, . . . , m}. Hence, 
Formula (4) can be rewritten into the claim: 

n 

3T[3j : {l..n} ^ {l..m} /\ e(r,.(,)) = T{n{e{Bim (5) 

i=l 

□ 

Hence, even if in the previous section the semantics of a composite shape was derived from 
the semantics of its components, in computing whether an image contains a composite shape 
one can focus on groups of regions, one group rj(i), . . . , rj(„) for each possible mapping j. 

Observe that j injective implies m > n, as one would expect. The above proposition 
leaves open which one between r or j must be chosen first. In fact, in what follows we 
show that the optimal choice for exact recognition is to mix decisions about j and r. When 
approximate recognition will be considered, however, exchanging quantifiers is not harmless. 
In fact, it can change the order in which approximations are made. We return to this 
issue in the next section, when we discuss how one can devise algorithms for approximate 
recognition. 

Subsumption in this simple logic for shape descriptions relies on the composition of 
contours of basic shapes. Intuitively, to actually decide if D is subsumed by C, we check 
if the sketch associated with D — seen as an image — would be retrieved using C as a 
query. Prom a logical perspective, the existentially quantified regions in the semantics of 
shape descriptions (1) are skolemized with their prototypical contours. Formal definitions 
follow. 

Definition 3 (Prototypical image) Let B be a basic shape. Its prototypical image is 
I{B) = {e{B)}. Let C = {ti,Bi) fl ■•• □ {Tn,Bn) be a composite shape description. Its 
prototypical image is I{C) = {ri(e(Bi)), . . . , Tn(e(B„))}. 

In practice, from a composite shape description one builds its prototypical image just ap- 
plying the stated transformations to its components (and color/texture fillings, if present). 
Recall that we envisage this prototypical image to be built directly by the user, with the 
help of a drawing tool, with basic shapes and colors as palette items. The system will 
just keep track of the transformations corresponding to the user's actions, and use them in 
building the (internal) shape descriptions stored with the previous syntax. The feature that 
makes our proposal different from other query-by-sketch retrieval systems, is precisely that 
our sketches have also a logical meaning. So, properties about description/sketches can be 
proved, containment between query sketches can be stated in a formal way, and algorithms 
for containment checking can be proved correct with reference to the semantics. 

Prototypical images have some important properties. The first is that they satisfy (in 
the sense of Definition 1) the shape description they exemplify — as intuition would suggest. 
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Proposition 3 For every composite shape description D, if D is satisfiable then the inter- 
pretation (X, {/(D)}) satisfies D. 

Proof. From Theorem 2, using an identical transformation r and the identity mapping for 



A shape description D is satisfiable if there are no overlapping regions in I{D). Since 
this is obvious when D is specified by a drawing tool, we just give the following proposition 
for sake of completeness. 

Proposition 4 A shape description D is satisfiable iff its prototypical image I{D) contains 
no overlapping regions. 

We now turn to subsumption. Observe that if Bi and B2 are basic shapes, either they 
are equivalent (each one subsumes the other) or neither of the two subsumes the other. If we 
adopt for the segmented regions an invariant representation, (e.g. Fourier transforms of the 
contour) deciding equivalence between basic shapes, or recognizing whether a basic shape 
appears in an image, is just a call to an algorithm computing the similarity between shapes. 
This is what usual image recognizers do — allowing for some tolerance in the matching 
of the shapes. Therefore, our framework extends the retrieval of shapes made of a single 
component, for which effective systems are already available. 

We now consider composite shape descriptions, and prove the main property of pro- 
totypical images, namely, the fact that subsumption between shape descriptions can be 
decided by checking if the subsumer can be recognized in the sketch of the subsumee. 

Theorem 5 A composite shape description C subsumes a description D if and only if C 

is recognized in the prototypical image I{D). 

Proof. Let C = (n, 5i) n • • ■ n {t„. En), and let D = (cri, ^1) n • • ■ n {am, Am). Recall that 
I{D) is defined by I{D) = {ai{e{Ai)), . . . , crm{e{Am))}. To ease the reading, we sketch the 
idea of the proof in Figure 5. 

If. Suppose C is recognized in I{D), that is, I{D) E for every interpretation (T, A) 
such that I{D) e A. Then, from Theorem 2 there exists a transformation f and a suitable 
injective function j from {1, . . . , n} into {1, . . . , m} such that 

e(rj(fc)) = f o Tk{e{Bk)) for A; = 1, . . . , n 

Since I{D) is the prototypical image of D, we can substitute each region with the basic 
shape of D it comes from: 

f^j(fe)(e(-4j(fc))) =f 0Tfe(e(Sfc)) for A; = l,...,n (6) 

Now suppose that D is recognized in an image J = {.si, . . . ,Sp}, with J G A. We prove that 
also C is recognized in J. In fact, if D is recognized in ./ then there exists a transformation 
a and another injective mapping q from {1, . . . , m} into {1, . . . ,p} selecting from J regions 
{sg(i), . . . , Sq(„)} such that 

e{sq{h)) = 0-0 ah{e{Ah)) for /i = 1, . . . , m (7) 
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(prototypical image of) C 




image J 



Figure 5: A sketch of the If-proof of Theorem 5 

Now composing q and j — that is, selecting the regions of J satisfying those components 
of D which are used to recognize C — one obtains 

e(sg(i(fc))) = ^o<^j(fc)(e(^j(fe))) for A; = l,...,n (8) 

Then, substituting equals for equals from (6), one finally gets 

e(sg(j(fc))) = CT o f o Tk{e{Bk)) for A; = 1, . . . , n 

which proves that C too is recognized in J, using (t o f as transformation of its components, 
and q{j{-)) as injective mapping from {1, . . . , n} into {1, . . . , 'p}- Since J is a generic image, 
it follows that C C^. Since (J, A) is generic too, C subsumes D. 

Only if. The reverse direction is easier: suppose C subsumes D. By definition, this 
amounts to D-^ C C-^ for every collection of images X. For every X that contains I{D), then 
I{D) e D-^ for Proposition 3. Therefore, I{D) e C-^, that is, C is recognized in I{D). q 

This property allows us to compute subsumption as recognition, so we concentrate on 
complex shape recognition, using Theorem 2. Our concern is how to decide whether there 
exists a transformation r and a matching j having the properties stated in Theorem 2. 
It turns out that for exact recognition, a quadratic upper bound can be attained for the 
possible transformations to try. 
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Theorem 6 Let C = {ti,Bi) □ • ■ • □ {Tn,Bn) be a composite shape description, and let I 
be an image, segmented into regions {ri, . . . ,rm}- Then, there are at most m{m — 1) exact 
matches between the n basic shapes and the m regions. Moreover, each possible match can 
be verified by checking the matching of n pairs of contours. 

Proof. A transformation r matching exactly basic components to regions is also an 
exact match for their centroids. Hence we concentrate on centroids. Each correspondence 
between a centroid of a basic component and a centroid of a region yields two constraints 
for r. Now r is a rigid motion with scaling, hence it has four degrees of freedom (two 
degrees for translations, one for rotation, and one for uniform scaling). Hence, if an exact 
match r exists between the centroids of the basic components and the centroids of some of 
the regions, then r is completely determined by the transformation of any two centroids of 
the basic shapes into two centroids of the regions. 

Fixing any pair of basic components Bi,i?2, let p^, P2 denote their centroids. Also, 
let ''j(i)) ^j(2) be the regions that correspond to Bi,B2, and let Vjj-i), Vj(2); denote their 
centroids. There is only one transformation r solving the point equations (each one mapping 
a point into another) 

^(n(Pi)) = Vj(i) 

Hence, there are only m{m — 1) such transformations. For the second claim, once a r 

matching the centroids is found, one checks that the edge contours of basic components 
and regions coincide, i.e., that T(ri(e(i?i))) = e(rj(i)), T(r2(e(i?2))) = e(rj(2)), and for 
k = 3, . . . ,n that T{Tk{e{Bk)) coincides with the contour of some region e{rj^j^^). 

Recalling Formula (5) in the proof of Theorem 2, this means that we can eliminate the 
outer quantifier in (5) using a computed r, and conclude that C is recognized in I iff: 

n 

3j : {l..n} ^ {l..m} /\ e(r,(,)) = T{n{eiB,))) 

i=l 

□ 

Observe that, to prune the above search, once a r has been found as above, one can 
check for /c = 3, . . . , n that T{Tk{centr{Bk))) coincides with a centroid of some region Vj, 
before checking contours. 

Based on Theorem 6, we can devise the following algorithm: 

Algorithm Recognize (C,/); 

input a composite shape description C = {ti,Bi) □ • • • □ (t„, Bn), and 

an image /, segmented into regions ri, . . . ,rm 
output True if C is recognized in /, False otherwise 
begin 

(1) compute the centroids vi, . . . of ri, . . . ,rm 

(2) compute the centroids Pi, . . . ,p„ of the components of C 

(3) for i, /i e {1, . . . , m} with i < h do 

compute the transformation r such that t(pi) = Vj and t(p2) = v/^; 
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if for every fee {1, . . . , n} 

T(Tfe(e(i?fc))) coincides (for some j) with a region rj in / 

then return True 
endfor 
return False 

end 

The correctness of Recognize (C,/) foUows directly from Theorems 2 and 6. Regarding 
the time complexity, step (1) requires to compute centroids of segmented regions. Several 
methods for computing centroids are well known in the literature (Jahne, Haubecker, & 
Geibler, 1999). Hence, we abstract from this detail, and assinne there exists a function 
f{Nh,Ny) that bounds the complexity of computing one centroid, where Nfi,Ny are the 
horizontal and vertical dimensions of / (number of pixels). We report in the Appendix how 
we compute centroids, and concentrate on the complexity in terms of n, m, and f{Nh, Ny). 

Theorem 7 Let C = (n, n ■ ■ • n (t„, B^) be a composite shape description, and let I be 
an image with x pixels, segmented into regions {ri, . . . ,rm}- Moreover, let f{Nh, Ny) 
be a function bounding the complexity of computing the centroid of one region. Then C can 
be recognized in I in time 0{m ■ f{Nh, Ny) + n + • n ■ ■ Ny). 

Proof. From the assumptions. Step (1) can be performed in time 0{m- f{Nh, Ny)). Instead, 
Step (2) can be accomplished by extracting the n translation vectors from the transforma- 
tions Ti,...,T„ of the components of C. Therefore, it requires 0{n) time. Finally, the 
innermost check in Step (3) — checking whether a transformed basic shape and a region 
coincide — can be performed in 0{Nh ■ Ny), using a suitable marking of pixels in I with 
the region they belong to. Hence, we obtain the claim. |— I 

Since subsumption between two shape descriptions C and D can be reduced to recogniz- 
ing C in I{D), the same upper bound holds for checking subsumption between composite 
shape descriptions, with the simplification that also Step (1) can be accomplished without 
any further feature-level image processing. 

4.2 Approximate Recognition 

The algorithm proposed in the previous section assumes an exact recognition. Since the 
target of retrieval are real images, approximate recognition is needed. We start by re- 
considering the proof of Theorem 2, and in particular the matrix of equalities (4). Using 
the semantics for approximate recognition (3), the expanded formula for evaluating C^{I) 
becomes now the following: 



Now Properties 1-2 stated for exact recognition can be reformulated as hypotheses about 
sim, as follows. 



max mm 



r 




229 



Dl SCIASCIO, DONINI & MONGIELLO 



1. For a given transformation, we assume that at most one region among ri, . . . ,r„i is 
maximally similar to each component. This assumption can be justified by supposing 
its negation: if there are two regions both maximally similar to a component, then 
this maximal value should be a very low one, lowering the overall value because of the 
external minimization. This means that in maximizing each row, we can assume that 
the maximal value is given by one index among 1, . . . , m. 

2. For a given transformation, we assume that a region can yield a maximal similarity for 
at most one component. Again, the rationale of this assumption is that when a region 
yields a maximal similarity with two components in two different rows, this value can 
be only a low one, which propagates along the overall minimum. This means that in 
minimizing the maxima from all rows, we can consider a different region in each row. 

We remark that also in the approximate case these assumptions do not imply that regions 
have all different shapes, since sim is a similarity measure which is 1 only for true overlap, 
not just for equal shapes with different pose. The assumptions just state that sim should 
be a function "near" to plain equality. 

The above assumptions imply that we can focus on injective mappings from {l..n} into 
{l..m} also for the approximate recognition, yielding the formula 

n 

max max mm{sim{e{rjM),T{Ti{e{Bi))))} 

■'■ j:{l..n}— >{l..m} i=l 

The choices of r and j for the two maxima are independent, hence we can consider groups 
of regions first: 

n 

rnax maxrnin{sim(e(r (i)), r(Ti(e(5i))))} (9) 

Differently from the exact recognition, the choice of an injective mapping j does not directly 
lead to a transformation r, since now r depends on how the similarity of transformed shapes 
is computed, that is, the choice of r depends on sim. 

In giving a definition of sim, we reconsider the other image features (color, texture) that 
were skipped in the theoretical part to ease the presentation of semantics. This will intro- 
duce weighted sums in the similarity measure, where weights are set by the user according 
to the importance of the features in the recognition. 

Let sim{r, {c,t,T,B)) be a similarity measure that takes a region r (with its color c(r) 
and texture t{r)) and a component (c, t, r, B) into the range [0, 1] of real numbers (where 1 
is perfect matching). We note that color and texture similarities do not depend on trans- 
formations, hence their introduction does not change Assumptions 1-2 above. Accordingly, 
Formula (9) becomes 

n 

max maxmm{sim{rja),{c,t,{TOTi),Bi))} (10) 

j:{l..n}^{l..m} r i=l ' 

This formula suggests that from all the groups of regions in an image that might resemble 

the components, we should select the groups that present the higher similarity. In artificially 
constructed examples in which all shapes in / and C resemble each other, this may generate 
an exponential number of groups to be tested. However, we can assume that in realistic 
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images the similarity between shapes is selective enough to yield only a very small number 
of possible groups to try. We recall that in Gudivada's approach (Gudivada, 1998) an 
even stricter assumption is made, namely, each basic component in C does not appear 
twice, and each region in I matches at most one component in C. Hence our approach 
extends Gudivada's one, also for this aspect — besides the fact that we consider shape, 
scale, rotation, color and texture of each component. 

In spite of the assumptions made, finding an algorithm for computing the "best" r 
in Formula (10) proved for us a difficult task. The problem is that there is a continuous 
spectrum of r to be searched, and that the best r may not be unique. We observed that 
when only single points are to be matched — instead of regions and components — our 
problem simplifies to Point Pattern Matching in Computational Geometry. However, even 
recent results in that research area are not complete, and cannot be directly applied to 
our problem. Cardoze and Schulman (1998) solve the nearly-exact point matching with 
efficient randomized methods, but without scaling. They also observe that best match is 
a more difficult problem than nearly-exact match. Also Chew, Goodrich, Huttenlocher, 
Kedem, Kleinberg, and Kravets (1997) propose a method for best match of shapes, but 
they analyze only rigid motions without scaling. 

Therefore, we adopt some heuristics to evaluate the above formula. First of all, we 
decompose sim{r, {c,t,T,B)) as a sum of six weighted contributions. 

Three contributions are independent of the pose: color, texture and shape. The val- 
ues of color and texture similarity are denoted by simcoior{c{r),c) and simtexture{t{r),t), 
respectively. Similarity of the shapes (rotation-translation-scale invariant) is denoted by 
simghape{G{r),e{B)). For each feature, and each pair (region, component) we compute a 
similarity measure as explained in the Appendix. Then, we assign to all similarities of a 
feature — say, color — the worst similarity in the group. This yields a pessimistic estimate 
of Formula (10); however, for such estimate the Downward Refinement property holds (see 
next Theorem 8). 

The other three contributions depend on the pose, and try to evaluate how the pose 
of each region in the selected group is similar to the pose specified by the corresponding 
component in the sketch. In particular, simscaiei&{T) , T{e{B)) represents how similar in scale 
are the region and the transformed component, while simrotation{e{r) , T{e{B)) denotes how 
e(r) and T{e{B) are similarly (or not) rotated with reference to the arrangement of the 
other components. Finally, simspatiai{'^iT)^ '''(^(-S)) denotes a measure of how coincident are 
the centroids of the region and the transformed component. 

In summary, we get the following form for the overall similarity between a region and a 
component: 

sim{r, (c, t, r, B)) = simspaUai{e{r),T{e{B)) ■ a + 
simshape{e{r)^e{B)) ■ (3 + 
simcoior{c{r), c) • 7 + 
sirrirotation{e{r),T{e{B)) ■ 5 + 
(e(r),T(e(i?))-r? + 

Simtexture{t{r),t) ■ e 
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where coefficients a, /3, 7, 5, rj, e weight the relevance each feature has in the overall similarity 
computation. Obviously, we impose a + f3 + 'y + 6 + r] + e = 1, and all coefficients are greater 
or equal to 0. The actual values given to these coefficients in the implemented system are 
reported in Table 2 in Section 6. 

Because of the difficulties in computing the best r, we do not compute a maximum over 
all possible r's. Instead, we evaluate whether there can be a rigid transformation with scaling 
from Ti(e(5i)), . . . , Tn{e{Bn)) into rj(i), . . . , r^j-^), through similarities 
and sinirotaUon- There is a transformation iff all these similarities are f . If not, the lower 
the similarities are, the less "rigid" the transformation should be to match components and 
regions. Hence, instead of Formula (10) we evaluate the following simpler formula: 

n 

max mm{sim{rj^i),{c,t,Ti,Bi))} (11) 

j:{l..n}^{l..m} i=l 

interpreting pose similarities in a different way. We now describe in detail how we estimate 
pose similarities. 

Let C = (ci, ti, Ti, Bi)) n • • • n (cn, tn, Tn, Bn)), and let j be an injective function from 
{l..n} into {l..m}, that matches components with regions {rj{i), ■ ■ ■ 5''j(n)} respectively. 

4.2.1 Spatial Similarity 

For a given component — say, component 1 — we compute all angles under which the other 
components are seen from 1. Formally, let arj^ be the counter-clockwise-oriented angle with 
vertex in the centroid of component 1, and formed by the lines linking this centroid with 
the centroids of component i and h. There are n(n — l)/2 such angles. 

Then, we compute the correspondent angles for region namely, angles f^j(j^^J(J)j(^h) 
with vertex in the centroid of formed by the lines linking this centroid with the 

centroids of regions rj(j) and rj(;i) respectively. A pictorial representation of the angles is 
given in Figure 6. 

Then we let the difference AspaUal{&{rj{i)) , Tii^i^i)) be the maximal absolute difference 
between correspondent angles: 

A.,„,,„Ke(r,(i)),n(e(Si)) = ^ ^jnax " f^mmm^'' 

We compute an analogous measure for components 2,. . . ,n, and then we select the maximum 
of such differences: 

^spatialij] = y^^{Aspatial{e{rj(^i^),Ti{e{Bi))} (12) 

where the argument j highlights the fact that this measure depends on the mapping 
j. Finally, we transform this maximal difference — for which perfect matching yields 
— into a minimal similarity — perfect matching yields 1 — with the help of the func- 
tion $ described in the Appendix. This minimal similarity is then assigned to every 
simspatial{e{rj(^i)), Ti{e{Bi)), for i = 1, . . . , n. 

Intuitively, our estimate measures the difference in the arrangement of centroids between 
the composite shape and the group of regions. If there exists a transformation bringing 
components into regions exactly, every difference is 0, and so sirUspatial raises to 1 for every 
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Figure 6: Representation of angles used for computing spatial similarity of component 1 
and region 



233 



Dl SCIASCIO, DONINI & MONGIELLO 




Figure 7: Representation of angles used for computing rotation similarity of component 1 
and region Vjf^iy 



component. The more an arrangement is scattered with reference to the other arrangement, 
the higher its maximum difference. The reason why we use the maximum of all differences 
as similarity for each pair component-region will be clear when we prove later that this 
measure obeys Downward Refinement property. 

4.2.2 Rotation Similarity 

For every basic shape one can imagine a unit vector with origin in its centroid and oriented 

horizontally on the right (as seen on the palette). When the shape is used as a component 
— say, component 1 — also this vector is rotated according to ri. Let h denote such a 

rotated vector. For i = 2, . . . , n let 7-^ the counter-clockwise-oriented angle with vertex in 

ilh 

the centroid of component 1, and formed by h and the line linking the centroid of component 
1 with the centroid of component i. 

For region f'^(i)) the analogous u of h can be constructed by finding the rotation phase 
for which cross-correlation attains a maximum value (see Appendix). Then, for i = 2, . . . , n 
let 6 ,r~r-^^ be the angles with vertex in the centroid of r^fi), and formed by n and the line 

linking the centroid of rj(i) with the centroid of rj(i). Figure 7 clarifies the angles we are 
computing. 

Then we let the difference Arotation{e{'>'j(i)), Ti{e{Bi)) be the maximal absolute difference 
between correspondent angles: 

If there is more than one orientation of rj^i) for which cross-correlation yields a maximum — 
e.g., a square has four such orientations — then we compute the above maximal difference 
for all such orientations, and take the best difference (the minimal one). 
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Figure 8: Sizes and distances for scale similarity computation of component 1 and region 



We repeat the process for components 2 to n, and we select the maximum of such 
differences: 

'^rotationlj] = max{A™tQtion(e(r,(i)), Tj(e(5i))} (13) 

Finally, as for spatial similarity, we transform ^rotation [j] into a minimal similarity with 
the help of This minimal similarity is then assigned to every simrotation{&{rj[i)),Ti{e{Bi)), 
for i = 1, . . . , n. 

Observe that also these differences drop to when there is a perfect match, hence the 
similarity raises to 1. The more a region has to be rotated with reference to the other 
regions to match a component, the higher the rotational differences. Again, the fact that 
we use the worst difference to compute all rotational similarities will be exploited in the 
proof of Downward Refinement. 



4.2.3 Scale Similarity 

We concentrate again on component 1 to ease the presentation. Let mi be the size of 
component 1, computed as the mean distance between its centroid and points on the contour. 
Moreover, for i = 2, . . . , n, let du be the distance between the centroid of component 1 and 
the centroid of component i. In the image, let Mj(x) be the size of region rj(i), and let 



D 



be the distance between centroids of regions and Figure 8 pictures the 



quantities we are computing. 

We define the difference in scale between e(rj(i)) and Ti{e{Bi) as: 



Ascaie(e(rj(i)),Ti(e(Bi)) 



max 

i=2,...,r7 



iin{M,-(i)/i:ij(i)^(i), mi/dii} 



max{M,(i)/Dj(i)j(i), mi/(iii} 
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We repeat the process for components 2 to n, and we select the maximum of such differences: 

^scaleij] = max{Asca/e(e(rj(i)),ri(e(Bi))} (14) 

Finally, as for the other similarities, we transform Agca^b] into a minimal similarity with 
the help of This minimal similarity is then assigned to every simscalei^{rj{i)),Ti{e{Bi)), 
for i = 1, . . . , n. 

4.2.4 Discussion of Pose Similarities 

Using the same worst difference in evaluating pose similarities of all components may appear 
a somewhat drastic choice. However, we were guided in this choice by the goal of preserving 
the Downward Refinement property, even if we had to abandon the exact recognition of the 
previous section. 

Theorem 8 Let C be a composite shape description, and let D be a refinement of C, that 
is, D = Cn{c', t', t', B'). For every image I, segmented into regions ri, . . . ,rm, if C^{I) and 
D^{I) are computed as in (11) using similarities defined above, then it holds D^{I) < C^{I). 

Proof. Every injective function j used to map components of C into / can be extended to 
a function j' by letting j'{n +1) G {!,..., m} be a suitable region index not in the range 
of j. Since D^{I) is computed over such extended mappings, it is sufficient to show that 
values computed in Formula (11) do not increase with reference to the values computed for 
C. 

Let ji be the mapping for which the maximum value C^{I) is reached. Every extension 
j( of ji leads to a minimum value raio^^^ in Formula (11) which is lower than C-^{I). In fact, 

all pose differences (12), (13), (14), are computed as maximums over a strictly greater set 
of values, hence the pose similarities have either the same value, or a lower one. Regarding 
color, texture, and shape similarities, adding another component can only worsen the values 
for components of C, since we assign to all components the worst similarity in the group. 

Now consider another injective mapping j2 that yields a non-maximum value V2 < C^{I) 
in Formula (11). Using the above argument about pose differences (12), (13), (14), every 
extension j'2 leads to a minimum value v'2 <V2- Since V2 < C^{I), also every extension of 
every mapping j different from ji yields a value which is less than C-^(I). This completes 
the proof. 1^ 



5. A Prototype System 

In order to substantiate our ideas we have developed a prototype system, written in C++. 
The system is a client-server application working in a MS-Windows environment. 

The client side avails of a graphical user interface that allows one to carry out all the 
operations necessary to query the knowledge base, including a canvas for query by sketch 
composition using basic shapes and a module for query by example using new or existing 
images as queries. The client also allows a user to insert new shape descriptions and images 
in the knowledge base. The client has the logical structure shown in Figure 9. It is made 
up of three main modules: sketch, communication and configuration. 
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Figure 9: Architecture of the prototype system. 
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^ t!) 



Figure 10: The process of reclassification of images when a new description is inserted: a) 
before insertion of description (No. 9); b) after insertion. 



The communication module manages the communication with the server side, using 
a simple application-level protocol. The configuration module allows one to modify the 
parameters relative to the preview of images and shapes transferred from the server and 
placed in a cache managed with a FCFS policy for efhcient display. The sketch module 
allows a user to trace basic shapes as palette items, and properly insert and modify them 
by varying the scale and rotation factor. The available shapes may be basic ones such as 
ellipse, circle, rectangle, polygons or obtained by composing the basic shapes or complex 
shapes defined during previous sessions of the application and inserted in the knowledge 
base, but also shapes extracted from segmented images. 

The system keeps track of the transformations corresponding to the user's actions, and 
uses them in building the (internal) shape descriptions stored with the previously described 
syntax. The color and texture of the drawn shapes can be set according to the user require- 
ments, as the client interface provides a color palette and the possibility to open images in 
JPEG format with texture content. The user can also load images from the local disk and 
transmit them to the server to populate the knowledge base. Finally, the user can define 
new objects endowing them with a textual description and insert them into the knowledge 
base. 

The server side, which is also shown in Figure 9, is composed by concurrent threads 
that manage the server-side graphical interface, the connections and communications with 
the client applications and carry out the processing required by the client side. Obviously, 
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the server also carries out all tasks related to the insertion of images in the knowledge base, 
including segmentation, feature extraction and region indexing, and allows one to properly 
set the various parameters involved. To this end, the server has three main subcomponents: 

1. the image features extractor that contains an image segmentation module and a region 
data extraction one; 

2. the image classifier that is composed by a classifier module and a module used in the 
image reclassification; 

3. the database management system. 

The feature extractor segments and processes images to extract relevant features from 
each detected region, which characterize the images in the knowledge base. Image segmen- 
tation is carried out with an algorithm that starts with the extraction of relevant edges and 
then carries out a region growing procedure that basically merges smaller regions into larger 
ones according to their similarity in terms of color and texture. Detected regions obviously 
have to comply with some minimal heuristics. Each region has associated a description of 
the relevant features. 

The classifier manages a graph that is used to represent and hierarchically organizes 
shape descriptions: basic shapes, and more complex ones obtained by combining such el- 
ementary shapes and/or by applying transformations (rotation, scaling and translation). 
The basic shapes have no parents, so they are at the top of the hierarchy. Images, when 
inserted in the knowledge base after the segmentation process, are linked to the descriptions 
in the structure depending on the most specific descriptions that they are able to satisfy. 

The classifier module is invoked when a new description D has to be inserted in the 
system or a new query is posed. The classifier carries out a search process in the hierarchy 
to find the exact position where the new description D (a simple or a complex one) has to be 
inserted: the position is determined considering the descriptions that the new description 
is subsumed by. Once the position has been found, the image reclassifier compares D with 
the images available in the database to determine those that satisfy it; all the images that 
verify the recognition algorithm are tied to D. This stage only considers the images that 
are tied to descriptions that are direct ancestors of D, as outlined in Figure 10. 

As usual in Description Logics, also the query process consists of a description insertion, 
as both a query Q and a new description D are treated as prototypical images: a query 
Q to the system is considered a new description D and added to the hierarchical data 
struc ture; all images that are connected either to Q or to descriptions below the query in 
the hierarchical structure are returned as retrieved images. 

The database management module simply keeps track of images and/or pointers to 
images. 

Using the system is a straightforward task. After logon a user can draw a sketch on 
the canvas combining available basic shapes, and enrich the query with color and texture 
content. After that the query can be posed to the server to obtain images ranked according 
to their similarity. Figure 11 shows a query by sketch with two circles and the retrieved 
set. The system correctly retrieves pictures of cars in which the two circles are recognized 
in the same relative positions of the sketch and represent the wheels, but also a snow man 
with black buttons. 
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Figure 12: Downward refinement (contd.): A more detailed query, picturing a car, and the 
retrieved set of images. 
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Figure 13: A Subsumption example: increasing the number of objects in the query leads to 
a correct reduction in the retrieved set. 
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The introduction of more details restricts the retrieved set: adding a chassis to the 
previous sketch makes the query more precise, as well as the retrieval results, as it is shown 
in Figure 12. This example points out how we expect a user will use the system. He/she will 
start with a generic query with a few objects. If the number of images in the retrieved set is 
still too large, he/she will increase the number of details obtaining a downward refinement. 

Notice that the presence of regions/objects not included in the query is obviously ac- 
cepted but not the lack of a region that was explicitly introduced in the query. The idea 
underlying this approach is that there is an enormous amount of available images, and at 
the current stage of research and technology no system can always ensure a complete recog- 
nition; yet we believe that the focus should be on reducing false positives, accepting without 
much concern a higher ratio of false negatives. This basically means increasing precision, 
even at the cost of a possibly lower recall. In other words we believe it is preferable for a 
user looking for an image containing a yellow car, e.g., using the sketch in Figure 12, that 
he/she receives as result of the query a limited subset of images containing almost for sure 
a yellow car, than a large amount of images containing cars, but also several images with 
no cars at all. 

Subsumption is another distinguishing feature of our system. Figure 13 shows queries 
composed of basic shapes that have been obtained by segmentation of an image picturing 
aircrafts, i.e., the aircraft is now a basic shape for the system. Here, to better emphasize 
the example, only shape and position contribute to the similarity value. The process of 
subsumption is clearly highlighted: a query with just a single aircraft retrieves images with 
one aircraft, but also with more than one aircraft. Adding other aircrafts in the graphical 
query correctly reduces the retrieved set. The example also points out that the system is 
able to correctly deal with the presence of more than one instance of an object in images, 
which is not possible in the approaches by Gudivada and Raghavan (1995) and Gudivada 
(1998). On the negative side it has to be noticed that the system did not recognize the 
presence of a third aircraft (indeed a strange one, the B2-Spirit) in the second image of 
Figure 13-b) , which was not segmented at all and considered part of the background. 

The ability of the system to retrieve complex objects also in images with several other 
different objects, that is with no "main shapes", can be anyway seen in Figure 14. Here a 
real image is directly submitted as query. Notice that in this case the system has to carry 
out the segmentation process on the fly, and detect the composing shapes. 

6. Experiments and Results 

In order to assess the performance of the proposed approach and of the system implementing 
it, we have carried out an extensive set of experiments on a test dataset of images. It is 
well known that evaluating performances of an image retrieval system is difficult because 
of lack of ground truth measures. To ease the possibility of a comparison, we adopted the 
approach first proposed by Gudivada and Raghavan (1995). The experimental framework is 
hence largely based on the one proposed there, which relies on a comparison of the system 
performances versus the judgement of human experts. 

It should be noticed that in that work test images were iconic images, which were 
classified only in terms of spatial relationships between icons; in our experiments images 
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Figure 14: A query by example and retrieved images. 
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are real and classification has been carried out on all image features, including color, texture, 
shape, scale, orientation and spatial relationships. 

The test data set consists of a collection of 93 images; a sample of them is shown in 
Figure 15, while the complete set is available at URL: 

http://www-ictserv.poliba.it/disciascio/jair_images.htm. 

Images have been acquired using a digital camera, combining 18 objects, either simple 
objects {i.e., a single shape) or composite ones, of variable size and color. All images had 
size 1080 X 720 pixels, 24 bits/pixel. It should be noticed that actually there were more 
than 18 different objects, but we considered very similar variants of an object, e.g., two 
pens with a different color, as a single test object. 

We selected from the test data set 31 images to be used as queries. The query set formed 
two logical groupings. 

The first one (namely queries 1 through 15 and queries 27, 30 and 31) had as primary 
objective testing the performance of the system using as query single objects composed by 
various shapes. That is, assessing the ability of the system to detect and retrieve images 
containing the same object, or objects similar to the query. 

The query images in the second group (remaining images in the test data set) pictured 
two or more objects and they were chosen to assess the ability of the system to detect and 
retrieve images according to spatial relationships existing between the objects in the query. 

Obviously the difference between queries containing single objects composed by several 
shapes, and queries containing two or more objects, is just a cognitive one: for our system 
all queries are composite shapes. However, we observed that performances changed for the 
two groupings. 

We then separately asked five volunteers to classify in decreasing order, according to 
their judgment, the 93 images based on their similarity to each image of the selected query 
set. The volunteers had never used the system and they were only briefly instructed that 
rank orderings had to be based on the degree of conformance of the database images with 
the query images. They were allowed to group images when considered equivalent, and for 
each query, to discard images that were judged wholly dissimilar from the query. 

Having obtained five classifications, which were not univocal, we created the final ranking 
merging the previous similarity rankings according to a minimum ranking criterion. The 
final ranking of each image with respect to a query was determined as the minimum one 
among the five available. 

As an example consider the classification of Query nr.l, which is shown in Table 1. 
Notice that images grouped together in the same cell have been given the same relevance. 
Here Image 2 was ranked in third position by users 1,4, and 5, but as users 2 and 3 ranked 
it in fourth position, it was finally ranked in position four. Notice that for image 24 the 
same criterion leads to its withdrawal from ranked images. This approach limits the weight 
that images badly classified by single users have on the final ranking. 

Then we submitted the same set of 31 queries to the system, whose knowledge base was 
loaded only with the 93 images of the test set. 

The behavior of the system obviously depends on the configuration parameters, which 
determine the relevance of the various features involved in the similarity computation. The 
configuration parameters fed to the system were experimentally determined on a test bed of 
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user 


ranking 




1st 


2nd 


3rd 


4th 


6th 


1 


1 


44, 88 


2, 3, 68, 80 


26 


24 


2 


1 


44, 88 


3, 68, 80 


2, 26 




3 


1 


44, 88 


3, 68, 80 


2, 26 




4 


1 


44, 88 


2, 3, 68, 80 


26 


24 


5 


1 


44, 88 


2, 3, 68, 80 


24 26 




final 


1 


44, 88 


3, 68, 80 


2, 26 





Table 1: Users rankings for query nr.l 



Parameter 


Value 


Fourier descriptors threshold 


0.98 


Circular symmetry threshold 


0.99 


Spatial similarity threshold 
Syimnctry maxima threshold 


0.30 
0.10 


Spatial similarity weight a 
Spatial similarity sensitivity fx 
spatial similarity sensitivity fy 


0.30 
90.0 
0.40 


shape similarity weight /3 

shape similarity sensitivity fx 
shape similarity sensitivity fy 


0.30 
0.005 
0.20 


color similarity weight 7 
color similarity sensitivity fx 
color similarity sensitivity fy 


0.11 
110.0 
0.40 


rotation similarity weight 6 
rotation similarity sensitivity fx 
rotation similarity sensitivity fy 


0.11 
90.0 
0.40 


texture similarity weight e 
texture similarity sensitivity fx 
texture similarity sensitivity fy 


0.07 
110.0 
0.40 


scale similarity weight 7] 
scale similarity sensitivity fx 
scale similarity sensitivity fy 


0.11 
0.50 
0.40 


global similarity threshold 


0.70 



Table 2: Configuration parameters, grouped by feature type. 



approximately 500 images before starting tlie test phase. They are shown in Table 2. The 
parameters reported here are described in the Appendix. Notice that, dealing with well- 
defined objects, we gave an higher relevance to shape and spatial features and a reduced 
relevance to scale, rotation, color and texture. 

The resulting classification gave us what was called a system-provided ranking. We then 
adopted the Rnorm as quality measure of the retrieval effectiveness. Rnorm has been first 
introduced in the LIVE-Project (Bollmann, Jochum, Reiner, Weissmann, & Zuse, 1985) 
for the evaluation of textual information retrieval systems and it has been used in the 
experiments of the above referenced paper by Gudivada and Raghavan. To make the paper 
self-contained we recall here how Rnorm is defined. 

Let G be a finite set of images with a user-defined preference relation > that is complete 
and transitive. Let A"'*'' be the rank ordering of G induced by the user preference relation. 
Also, let A*^* be a system-provided ranking. The formulation of Rnorm is: 

^ J max 

where is the number of image pairs where a better image is ranked by the system 
ahead of a worse one; S~ is the number of pairs where a worse image is ranked ahead of a 
better one and S^^^^ is the maximum possible number of S'^ . It should be noticed that the 
calculation of , S~ , and S""*^ is based on the ranking of image pairs in A**^** relative to 
the ranking of corresponding image pairs in A"**". 
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Query nv. 


Image nv. 


Rnorm 


|1 


1 


0.92 


|2 


2 


0.92 


|3 


3 


0.93 


1 ^ 


4 


0.95 


1 ^ 


5 


0.99 


|6 


6 


0.94 


|7 


7 


0.93 


1 ° 


10 


0.93 


fg 


11 


0.95 


flO 


12 


0.74 


fll 


13 


0.60 


tl2 


14 


0.84 


tl3 


15 


0.83 


tl4 


18 


0.99 


tl5 


20 


0.91 


16 


25 


0.89 


17 


26 


0.80 


18 


27 


1.00 


19 


28 


0.74 


20 


31 


1.00 


21 


33 


1.00 


22 


34 


0.99 


23 


35 


0.91 


24 


36 


0.89 


25 


37 


1.00 


26 


39 


0.99 


t27 


41 


0.93 


28 


42 


0.98 


29 


50 


1.00 


t30 


78 


0.88 


t31 


79 


1.00 


Average Rnorm 




0.92 



Table 3: Rnorm values, (\indicates single-object queries) 



Rnorm values are in the range [0,1]; a value of 1 corresponds to a system-provided 
ordering of the database images that is either identical to the one provided by the human 
experts or has a higher degree of resolution, lower values correspond to a proportional 
disagreement between the two. 

Table 3 shows results for each query and the final average i?nor-m=0.92. Taking a closer 
look at results, for the first group of queries (single compound objects) the average value 
was Rnorm=0-90, and i?„orm=0.94 for the second grouping (various compound objects). 
(The complete set of result for users' ranking and system ranking is available in the online 
appendix) . 

As a comparison, the average Rnorm resulted 0.98 in the system presented by Gudivada 
and Raghavan (1995), where 24 iconic images were used both as queries and database 
images, and similarity was computed only on spatial relationships between icons. We remark 
here that our system works on real images and computes similarity on several image features, 
and we believe that results prove the ability of the system to catch to a good extent the 
users information need, and make refined distinctions between images when searching for 
composite shapes. Furthermore, our algorithm is able to correctly deal with the presence 
of more than one instance of an object in images, which is not possible in other approaches 
(Gudivada, 1998). It is also noteworthy that, though the parameters setting has been the 
object of several experiments, it cannot be considered optimal yet, and we believe that there 
is room for further improvement in the system performance, as it is also pointed out in the 
following paragraph. 

Obviously the system can fail when segmentation does not provide accurate enough 
results. Figure 16 shows results for Query 11, which was the one with the worst Rnorm- 
Here the system not only did not retrieve all images users had considered relevant, but 
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Search images [sketch) 
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Search objects [text) 



^ Add image 



92.839 % 





Figure 16: Query results for query 11, which had the lowest Rr, 



=0.60. 



more important wrongly confused the sugar-drop with a wrist-watch, which resulted in a 
false positive. As a matter of fact in various images the sweet-drops resulted not properly 
segmented. Nevertheless, highly relevant images were successfully retrieved and the wrongly 
retrieved one was slightly above the selection threshold. 

Another observation we made was that human users, when comparing a query with a 
single object, were much more driven by the color than any other feature, including the 
spatial positioning. This appeared in various queries and is again clearly visible using as 
example results for Query 11. Here users selected in the highest relevance class only images 
with the same color sugar-drop, and gave a lower ranking to images (with sugar-drops) with 
closer spatial relationships but different colors. This observation may be significant in the 
related field of object recognition. 

A final comment. With reference to the system behavior in terms of retrieval time, we 
did not carry out a systematic testing, as it depends on several variables: number of images 
in the database, number of objects in the query, but more important depth in the hierarchy 
- as the search time decreases as more basic shapes are available. Limiting our analysis 
to the database loaded with the 93 test images, the system required on average 12 sees to 
answer a query, on a machine with Celeron 400 MHz CPU and 128 MB RAM running both 
the client and the server. 
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7. Conclusion 

We proposed a Knowledge Representation approach to Image Retrieval. We started from 
the observation that current sketch-based image retrieval systems lack of a compositional 
query language — that is, they are not able to handle queries made by several shapes, where 
the position, orientation and size of the shapes relative to each other is meaningful. 

To recover this, we proposed a language to describe composite shapes, and gave an 
extensional semantics to queries, in terms of sets of retrieved images. To cope with a 
realistic setting from the beginning, we also generalized the semantics to fuzzy membership 
of an image to a description. The composition of shapes is made possible by the explicit 
use in our language of geometric transformations (translation- rotation-scale), which we 
borrow form hierarchical object modeling in Computer Graphics. We believe that this 
is a distinguishing feature of our approach, that significantly extends standard invariant 
recognition of single shapes in image retrieval. The extensional semantics allows us to 
properly define subsumption (i.e., containment) between queries. 

Borrowing also from Structured Knowledge Representation, and in particular from De- 
scription Logics, we stored shape descriptions in a subsumption hierarchy. The hierarchy 
provides a semantic index to the images in a database. The logical semantics allowed us 
to define other reasoning services: the recognition of a shape arrangement in an image, the 
classification of an image with reference to a hierarchy of descriptions, and subsumption 
between descriptions. These tasks are aside, but can speed up, the main one, which is Image 
Retrieval. 

We proved that subsumption in our simple logic can be reduced to recognition, and 
gave a polynomial-time algorithm to perform exact recognition. Then, for a realistic ap- 
plication of our setting we extended the algorithm to approximate recognition, weighting 
shape features (orientation, size, position), color and texture. 

Using our logical approach as a formal specification, we built a prototype system using 
state-of-the-art technology, and set up experiments both to assess the efficacy of our pro- 
posal, and to fine tune all parameters and weights that show up in approximate retrieval. 
The results of our experiments, although not exhaustive, show that our approach can catch 
to a good extent the users information need and make refined distinctions between images 
when searching for composite shapes. 

We believe that this proposal opens at least three directions for future research. First, 
the language for describing composite shapes could be enriched either with other logic- 
oriented connectives — e.g., alternative components corresponding to an OR in composi- 
tions — or to sequences of shape arrangements, to cope with objects with internal move- 
ments in video sequence retrieval. Second, techniques from Computational Geometry could 
be used to optimize the algorithms for approximate retrieval, while a study in the complex- 
ity of the recognition problem for composite shapes might prove the theoretical optimality 
of the algorithms. Finally, large-scale experiments might prove useful in understanding the 
relative importance attributed by end users to the various features of a composite shape. 
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Appendix A. 

In this appendix we briefly revise the methods we used for the extraction of image features. 
We also describe the smoothing function $ and the way we compute similarity for the image 
features that were introduced in Section 4.2. 

A.l Extraction of Image Features 

In order to deal with objects in an image, segmentation is required to obtain a partition 
of the image. Several segmentation algorithms have been proposed in the literature; our 
approach does not depend on the particular segmentation algorithm adopted. It is anyway 
obvious that the better the segmentation, the better our system will work. In our system 
we used a simple algorithm that merges edge detection and region growing. 

Illustration of this technique is beyond the scope of this paper; we limit here to the 
description of image features computation, which assume a successful segmentation. To 
make the description self-contained we start defining a generic color image as {7(a;, y) \ 1 < 
X < Nfi, i < y < Njj}, where N^, Ny are the horizontal and vertical dimensions, respectively, 
and / {x,y) is a three-components tuple {R,G,B). We assume that the image / has been 
partitioned in m regions (r^), i = 1, . . . ,m satisfying the following properties: 



• V ii € {1, 2, . . . , m}, Tj is a nonempty and connected set 

• n rj = iff i ^ j 

• each region satisfies heuristic and physical requirements. 

We characterize each region r, with the following attributes: shape, position, size, ori- 
entation, color and texture. 

Shape. Given a connected region a point moving along its boundary generates a complex 
function defined as: z{t) = x(t) + jy{t), t = 1, . . . ,Nf„ with Ni the number of boundary 
sample points. Following the approach proposed by Rui, She, and Huang (1996) we define 
the Discrete Fourier Transform (DFT) of z{t) as: 



= 1,2,...,! 



rn 




t=l 



with k = l,...,Nb. 
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In order to address the spatial discretization problem we compute the Fast Fourier 
Transform(FFT) of the boundary z{t); use the first {2Nc + 1) FFT coefficients to form a 
dense, non-uniform set of points of the boundary as: 

N 

Zdenseit) = ^ Z{k)e ' 
k=-Nc 

with t=l,.. .,Ndense- 

We then interpolate these samples to obtain uniformly spaced samples Zunif{t), t = 
0, . . . , Nunif- We compute again the FFT of Zunif{t) obtaining Fourier coeflficients Zunif{k), 
k = —Nc, . . . , Nc. The shape-feature of a region is hence characterized by a vector of 27Vc+l 
complex coefficients. 

Position and Size. Position is determined as the region centroid computed via moment 
invariants (Pratt, 1991). Size is computed as the mean distance between region centroid 
and points on the contour. 

Orientation. In order to quantify the orientation of each region we use the same 
Fourier representation, which stores the orientation information in the phase values. We 
obviously deal also with special cases when the shape of a region has more than one sym- 
metry, e.g., a rectangle or a circle. Rotational similarity between a reference shape B and 
a given region can then be obtained finding maximum values via cross-correlation: 

'^W = ^J^^TXT E ZB{k)Zr,{k) ■ e^'^^" with teO,...,2Nc 

Color. Color information of each region is stored, after quantization in a 112 values 
color space, as the mean RGB value within the region: 

pGr, peri p€ri 

Texture. We extract texture information for each region r-j with a method based on 
the work by Pok and Liu (1999). Following this approach, we extract texture features 
convolving the original grey level image I{x, y) with a bank of Gabor filters, having the 
following impulse response: 

h{x, y) = — .e-^^- e^-2-(£^-+^2/) 

where (IJ, V) represents the filter location in the freqiiency-domain, A is the central fre- 
quency, G is the scale factor, and d the orientation, defined as: 

\ = ^fWv\P■ = arctanf//y 

The processing allows to extract a 24-components feature vector, which characterizes 
each textured region. 



252 



Structured Knowledge Representation for Image Retrieval 



A. 2 Functions for Computing Similcu-ities 

Smoothing function In all similarity measures, we use the function ^{x, fx, fy). The 
role of this function is to change a distance x (in which corresponds to perfect matching) 
to a similarity measure (in which the value 1 corresponds to perfect matching), and to 
"smooth" the changes of the quantity x, depending on two parameters fx, fy. 



Hx, fx, fy) 



/y + (l-/2/)-cos(2^) ifO<x<fx 

-Mil 

ii X > fx 



arctanf^^liSi^iHi-M] 
^ fx-fy ' 



fy 

where fx>0 and < fy < 1. 

The input data to the approximate recognition algorithm are a shape description D, 
containing n components (cfc, tk, Tk, Bk) and an image / segmented into m regions ri, . . . jr^- 
The algorithm provides a measure for the approximate recognition of D in I. 

The first step of the algorithm in Section 4.2 considers all the m regions the image is 
segmented into and all the n components in the shape description D and finds — if any — all 
the groups of n regions rjf^k) satisfying the higher shape similarity with the shape components 
of D. To this purpose we compute shape similarity, based on the Fourier representation 
previously introduced, as vector of complex coefficients. Such measure denoted with sim,ss 
is invariant with respect to rotation, scale and translation and is computed as the cosine 
distance between the two vectors. The similarity gives a measure in the range [0,1] assuming 
the higher similarity siruss = 1 for perfect matching. 

Given the vectors X and Y of complex coefficients describing respectively the shape of 
a region and the shape of a component Bk, X = {xi, . . . , X2Nc) and Y = {yi, . . . , y2Nc) 

tn X T^i=ixiyi 
simss{Bk,ri) = 



Shape Similarity. The quantity sinishape measures the similarity between shapes in 
the composite shape description and the regions in the segmented image. 

sirUshape = [1 - simssiBk,rj(^k))], fXshape, fVshape) 

Color Similarity. The quantity sirricoior measures the similarity in terms of color 

appearance between the regions and the corresponding shapes in the composite shape de- 
scription. In the following formula, Acoiorik)-R denotes the difference in the red color 
component between the k-ih component of D and the region rj(yt), and similarly for the 
green and the blue color components. 



^color{k) color 

Then the function $ takes the maximum of the differences to obtain a similarity: 

sirUcolor = ^i^aMAcolorik)}, fXcolor, fVcolor) 
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Texture Similarity. Finally, sirritexture measures the similarity between the texture 
features in the components of D and in the corresponding regions. 

^texture{k) denotes the sum of differences in the texture components between the A;-th 
component of D and the region rj(fc) and dividing by the standard deviation of the elements. 

sirritexture — ^{^^^^ ^texturei^) j f Xtexturej fUtexture) 
k=l 
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